First Steps with Stan

Maud goes to Vegas

Andrew B. Collier

andrew@exegetic.biz | DataWookie | DataWookie

eRum (Budapest)

15 May 2018 at 16:00

Meet Maud

description symbols payout
0
1x mouse 🐭 1
1x cat 🐱 2
2x mouse 🐭🐭 5
2x cat 🐱🐱 10
3x mouse 🐭🐭🐭 20
3x cat 🐱🐱🐱 100

Maud’s Burning Questions

  • What is the hit rate?
  • What is the RTP?
  • What is the hit rate for each winning combination?

Maud’s Data (by Session)

sessions
# A tibble: 100 x 8
   session               details spins  hits wager payout  hit_rate       rtp
     <int>                <list> <int> <int> <int>  <dbl>     <dbl>     <dbl>
 1       1  <data.frame [7 x 3]>     7     2    10      3 0.2857143 0.3000000
 2       2 <data.frame [19 x 3]>    19     7    30     29 0.3684211 0.9666667
 3       3 <data.frame [19 x 3]>    19     3    22      3 0.1578947 0.1363636
 4       4 <data.frame [26 x 3]>    26     7    30     13 0.2692308 0.4333333
 5       5 <data.frame [23 x 3]>    23     8    31     35 0.3478261 1.1290323
 6       6 <data.frame [20 x 3]>    20     8    26     12 0.4000000 0.4615385
 7       7 <data.frame [22 x 3]>    22     5    30     20 0.2272727 0.6666667
 8       8 <data.frame [22 x 3]>    22     4    25     10 0.1818182 0.4000000
 9       9 <data.frame [18 x 3]>    18     4    26      6 0.2222222 0.2307692
10      10 <data.frame [26 x 3]>    26     8    33     75 0.3076923 2.2727273
# ... with 90 more rows

Maud’s Data (by Spin)

spin <- sessions %>%
  select(session, details) %>%
  unnest() %>%
  mutate(success = as.integer(payout > 0))
# A tibble: 1,972 x 5
   session  spin wager payout success
     <int> <int> <int>  <dbl>   <int>
 1       1     1     1      1       1
 2       1     2     1      0       0
 3       1     3     1      0       0
 4       1     4     3      0       0
 5       1     5     2      0       0
 6       1     6     1      2       1
 7       1     7     1      0       0
 8       2     1     2      0       0
 9       2     2     2      0       0
10       2     3     1      1       1
# ... with 1,962 more rows

Q1: Hit Rate

Probability distribution for binomial process:

\[ P(k | n, \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n - k} \]

where

  • \(k\) successes in \(n\) trials; and
  • the probability of success on any trial is \(\theta\).

Multiple experiments: for session \(i\) there are \(k_i\) hits from \(n_i\) spins.

Probability distribution for Bernoulli process:

\[ P(k | \theta) = \theta^k (1 - \theta)^{1 - k} \]

where

  • \(k = 0\) indicates failure;
  • \(k = 1\) indicates success; and
  • the probability of success on any trial is \(\theta\).

Bayes’ Theorem

\[ p(\theta|y, X) = \frac{p(y|X, \theta) p(\theta)}{p(y)} \propto p(y|\theta) p(\theta) \] where

  • \(y\) are observations;
  • \(X\) are predictors;
  • prior — \(p(\theta)\);
  • likelihood — \(p(y|X, \theta)\); and
  • posterior — \(p(\theta|y, X)\).

Stan:

RStan:

Stan Workflow

  1. Choose a model.
  2. Write Stan program (likelihood and priors).
  3. Stan parser converts this to C++.
  4. Compile C++.
  5. Execute compiled binary.
  6. Evaluate results. Optionally return to 2.
  7. Inference based on posterior sample.

To use rstan you need a .stan and a .R .

Stan Skeleton

data {
  int<lower = 0> N;
  int<lower = 0, upper = 1> y[N];
}
parameters {
  real<lower = 0, upper = 1> theta;
}
model {
  y ~ bernoulli(theta);  
}

library(rstan)

# Use all available cores.
#
options(mc.cores = parallel::detectCores())

trials <- list(
  N       = nrow(spin),
  y       = spin$success
)

fit <- stan(
  file    = "bernoulli.stan",
  data    = trials,
  chains  = 2,                         # Number of Markov chains
  warmup  = 500,                       # Warmup iterations per chain
  iter    = 1000                       # Total iterations per chain
)

class(fit)
[1] "stanfit"
attr(,"package")
[1] "rstan"
samples <- as.data.frame(fit)

head(samples)
      theta      lp__
1 0.3095256 -1226.968
2 0.3072672 -1227.066
3 0.3124851 -1226.912
4 0.3200480 -1227.132
5 0.3140362 -1226.914
6 0.2929604 -1228.812
nrow(samples)
[1] 1000

print(fit, probs = c(0.025, 0.5, 0.975))
Inference for Stan model: bernoulli.
2 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=1000.

          mean se_mean   sd     2.5%      50%    97.5% n_eff Rhat
theta     0.31    0.00 0.01     0.29     0.31     0.33   413    1
lp__  -1227.43    0.04 0.70 -1229.56 -1227.14 -1226.91   348    1

Samples were drawn using NUTS(diag_e) at Tue May 15 17:30:35 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Use summary() to get information for each chain.

data {
  int<lower=0> N;
  int hits[N];
  int spins[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  hits ~ binomial(spins, theta);       // Likelihood
  theta ~ beta(2, 2);                  // Prior
}

print(fit, probs = c(0.025, 0.5, 0.975))
Inference for Stan model: binomial-beta-prior.
2 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=1000.

          mean se_mean   sd     2.5%      50%    97.5% n_eff Rhat
theta     0.31    0.00 0.01     0.29     0.31     0.33   280    1
lp__  -1228.96    0.03 0.75 -1231.15 -1228.66 -1228.45   545    1

Samples were drawn using NUTS(diag_e) at Tue May 15 17:31:38 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Q2: RTP

data {
  int<lower=0> N;
  real rtp[N];
}
parameters {
  real<lower = 0> mu;
  real<lower = 0> sigma;
}
model {
  rtp ~ lognormal(log(mu) - sigma * sigma / 2, sigma);
}
generated quantities {
  real simulated[25];
  for (i in 1:25) {
    simulated[i] = lognormal_rng(log(mu) - sigma * sigma / 2, sigma);
  }
}

Inference for Stan model: lognormal-rtp.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
mu      0.80    0.00 0.07   0.69   0.76   0.80   0.84   0.94  1901    1
sigma   0.72    0.00 0.05   0.62   0.68   0.71   0.75   0.83  1852    1
lp__  -16.12    0.02 1.01 -18.77 -16.48 -15.81 -15.41 -15.16  1759    1

Samples were drawn using NUTS(diag_e) at Tue May 15 17:32:48 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Here’s the Kicker

What’s the probability of breaking even?

mean(simulated$rtp > 1)
[1] 0.25117

What’s the probability of doubling your money?

mean(simulated$rtp > 2)
[1] 0.05319

Q3: Hit Rate per Combination

description symbols payout
0
1x mouse 🐭 1
1x cat 🐱 2
2x mouse 🐭🐭 5
2x cat 🐱🐱 10
3x mouse 🐭🐭🐭 20
3x cat 🐱🐱🐱 100
data {
  int<lower=0> N;
  real rtp[N];
}
parameters {
  real<lower=0, upper=1> theta[6];
  real<lower=0> sigma;
}
transformed parameters {
  real<lower=0> mu;
  mu = theta[1] * 1 +                  // Payline 1 -> 1x
       theta[2] * 2 +                  // Payline 2 -> 2x
       theta[3] * 5 +                  // Payline 3 -> 5x
       theta[4] * 10 +                 // Payline 4 -> 10x
       theta[5] * 20 +                 // Payline 5 -> 20x
       theta[6] * 100;                 // Payline 6 -> 100x
}
model {
  rtp ~ lognormal(log(mu) - sigma * sigma / 2, sigma);
  theta[1] ~ beta(2, 2);               // Mode = 0.5
  theta[2] ~ beta(2, 4);               // Mode = 0.25
  theta[3] ~ beta(2, 5);               // Mode = 0.2
  theta[4] ~ beta(2, 8);               // Mode = 0.125
  theta[5] ~ beta(2, 10);              // Mode = 0.1
  theta[6] ~ beta(2, 16);              // Mode = 0.0625
}

                mean      se_mean           sd         2.5%       97.5%    n_eff      Rhat
theta[1] 0.140072477 1.445106e-03 0.0886783678 0.0186286589 0.352326667 3765.615 0.9993099
theta[2] 0.068064522 6.711297e-04 0.0424459690 0.0102927430 0.168731470 4000.000 1.0001725
theta[3] 0.028892776 2.894141e-04 0.0183041579 0.0033953293 0.072914250 4000.000 0.9998855
theta[4] 0.014594931 1.437237e-04 0.0090898869 0.0019018620 0.036523596 4000.000 0.9999596
theta[5] 0.007441395 7.525103e-05 0.0047592927 0.0009537422 0.019112895 4000.000 1.0000193
theta[6] 0.001517744 1.482361e-05 0.0009375275 0.0002257116 0.003729426 4000.000 1.0005894

The 1x payout is triggered on average every 7 spins.

The 100x payout is triggered on average only every 659 spins.

Exit Maud!

Why Stan?

  • Extract maximum information from your data.
  • Learning curve is not as steep as you might think.

Do
cool things
with
Stan!

Extra Stuff

Metropolis-Hastings Algorithm

  1. Randomly sample \(\theta^{(0)}\).
  2. Set \(i = 1\).
  3. Randomly sample proposal \(\theta^{*}\) in the vicinity of \(\theta^{i-1}\).
  4. Sample \(u\) uniformly on \([0, 1)\).

\[ \theta^{(i)} = \left\{\begin{array}{ll} \theta^{*} & \text{if } u \cdot p(\theta^{i-1}|y, X) < p(\theta^{*}|y, X) \\ \theta^{(i-1)} & \text{otherwise.} \end{array}\right. \]

  1. Increment \(i\).
  2. Return to 1.

Poisson

# trials <- list(
#   N       = nrow(sessions),
#   spins   = sessions$spins
# )
# 
# fit <- stan(
#   file    = "poisson.stan",
#   data    = trials,
#   chains  = 4,
#   warmup  = 1000,
#   iter    = 2000
# )
# extract(fit)

# hist(extract(fit)$lambda)

Regression

# ggplot(sessions, aes(spins, wager)) + geom_jitter()

Regression Stan

data {
  int<lower=0> N;
  vector[N] y;                         // Total wager
  vector[N] x;                         // Number of spins (standardised)
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma); // Likelihood
  //
  alpha ~ normal(0, 10);               // Prior (Normal)
  beta ~ normal(0, 10);                // Prior (Normal)
  sigma ~ cauchy(0, 5);                // Prior (Cauchy)
}

Regression Stan R

# summary(fit)
… the theory of inverse probability is founded upon error and must be wholly rejected. Sir Ronald Fisher (1925)

\[ \text{hit rate} = \theta^* = \frac{\text{total hits}}{\text{total spins}} = \frac{\sum_i k_i}{\sum_i n_i} \]

with(sessions, sum(hits) / sum(spins))
[1] 0.3128803

\[ \text{hit rate for session } i = \theta^*_i = \frac{k_i}{n_i} \]

# A tibble: 1 x 2
  session_avg session_std
        <dbl>       <dbl>
1   0.3100018   0.1030164

95% confidence interval extends from 28.9% to 33.1%.

Assuming that sessions \(i = 1, 2, 3, \ldots\) are independent:

\[ P(k | n, \theta) = \prod_i P(k_i | n_i, \theta) = L(\theta; n, k) \]

Log-likelihood for binomial process (multiple trials):

\[ LL(\theta; n, k) = \sum_i k_i \log{\theta} + (n_i - k_i) \log{(1 - \theta)} \]

# A tibble: 6 x 2
  theta log_likelihood
  <dbl>          <dbl>
1  0.31      -1225.411
2  0.32      -1225.604
3  0.30      -1226.146
4  0.33      -1226.692
5  0.29      -1227.843
6  0.34      -1228.649

Maud’s Bayesian Answers

  • What is the hit rate? 31.3%
  • What is the RTP? 80.1%
  • What is the hit rate for each winning combination?
    • frequent low paying combinations (every 7 spins for 1x)
    • rare high paying combinations (every 659 spins for 100x)

Monte Carlo (Plain Vanilla Version)

Generate independent samples of \(\theta^{(i)}\).

  • Need to have \(p(\theta^{(m)} | y, X)\).

Markov Chain Monte Carlo (MCMC)

Generate a series of samples: \(\theta^1\), \(\theta^2\), \(\theta^3\), … \(\theta^M\).

  • \(\theta^m\) depends on \(\theta^{m-1}\).
  • Need to have \(p(\theta^{m} | \theta^{m-1}, y, X)\).

ShinyStan

library(shinystan) launch_shinystan(fit) # # Diagnose -> PPcheck “Posterior predictive check” (look at distribution of observed data versus replications - do they look similar? “If our model is a good fit then we should be able to use it to generate data that looks a lot like the data we observed.”)

Resources